Introduction to Web Scraping and Data Management for Social Scientists

Session 2: Introduction to the Web

Johannes B. Gruber

2023-07-25

Introduction

The Plan for Today

In this session, we learn how to scout data in the wild. We will:

  • discuss web scraping from a theoretical point of view:
    • What is web scraping?
    • Why should you learn it?
    • What legal and ethical implications should you keep in mind?
  • learn a bit more about how the Internet works
    • What is HTML
    • What is CSS

Angie Gade via unsplash.com

What is Web Scraping

Forms to get data from the web

  • Download data from a website
  • Retrieve data via an API
  • Scrape the (unstructured) Data

Image Source: daveberesford.co.uk

Web Scraping

  • A technique to extract specific data from a web page
    • Get the author, title, date and body of an only news article
    • A table from a website (like Wikipedia)
  • This can include hyperlinks which you can systematically go through
  • You can write a “robot” that systematically extracts data and saves it in a format convenient for analysis
    • All news about a specific topic from a news site
    • All press releases from a political party website
    • All Posts on a web forum about a specific topic
    • All participants of an event from the event’s website
  • A web-scraper is a program that goes to web pages, downloads the contents, extracts data out of the contents and then saves the data to a file or a database
  • Unfortunately not one-size-fits-all solution
    • Lots of different techniques, tools, tricks
    • Websites change (some more frequently than others)
    • Some websites make it hard for you (by accident or on purpose!)

Web Scraping: A Three-Step Process

  1. Send an HTTP request to the webpage -> server responds to the request by returning HTML content
  2. Parse the HTML content -> extract the information you want from the nested structure of HTML code
  3. Wrangle the data into a useful format

Original Image Source: prowebscraper.com

Hurdles

Some web pages are easier to scrape than others:

  1. Well behaved static HTML with recurring patterns
  2. Haphazard HTML not clearly differentiating between different types of information
  3. Interactive web sites loading content by executing code (usually JavaScript or PHP)
  4. Interactive web sites with mechanisms against data extraction (rate limits, captchas etc.)

Why Should You Learn Web Scraping?

  • The internet is a data gold mine!
  • Data was not created for research, but are often traces of what people are actually doing on the internet
  • Reproducible and renewable data collection (e.g., rehydrate data that is copyrighted)
  • Web Scraping let’s you automate data retrieval (as opposed to using tedious copy & past on some web site)
  • It’s one of the most fun tasks to learn R and programming!
    • It’s engaging and satisfying to find repeating patterns that you can employ to structure data (every website becomes a little puzzle)
    • It touches on many important computational skills
    • The return is good data to further your career (unlike sudokus or video games)

@realDonaldTrump Twitter usage Clarke and Grieve (2019)

Implications of Web Scraping

ToS and Robots.txt

Twitter ToS

User-agent: *                         # the rules apply to all user agents
Disallow: /EPiServer/CMS/             # do not crawl any URLs that start with /EPiServer/CMS/
Disallow: /Util/                      # do not crawl any URLs that start with /Util/ 
Disallow: /about/art-in-parliament/   # do not crawl any URLs that start with /about/art-in-parliament/

https://www.parliament.uk/robots.txt

Ethical

  • Are there other means available to get to the data (e.g., via an API)?
  • robots.txt might not be legally binding, but it is not nice to ignore it
  • Scraping can put a heavy load on website (if you make 1000s of requests) which costs the hosts money and might bring down a site (DDoS attack)
  • Think twice before scraping personal data. You should ask yourself:
    • is it necessary for your research?
    • are you harming anyone by obtaining (or distributing) the data?
    • do you really need everything or are parts of the data sufficient (e.g., can you preselect cases or ignore variables)

Advice?

Legal and ethical advice is rare and complicated to give. A good opinion piece about it is Freelon (2018). It is worth reading, but can be summarised in three general pieces of advice

  • use authorized methods whenever possible
  • do not confuse terms of service compliance with data protection
  • understand the risks of violating terms of service

Exercises 1

Twitter recently made access to their API punishingly expensive and stopped free academic access for research. If you wanted to do research on Twitter data through web-scraping anyway what implications would that have:

  1. Legally

  2. Ethically

  3. Practical

What are HTML and CSS

What is HTML

  • HTML (HyperText Markup Language) is the standard markup language for documents designed to be displayed in a web browser
  • Contains the raw data (text, URLs to pictures and videos) plus defines the layout and some of the styling of text

Image Source: Wikipedia.org

Example: Simple

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <p>This is the body of the text.</p>
</body>
</html>

Browser View:

Example: With headline and author

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author" href="https://www.johannesbgruber.eu/">Me</p>
    <p>This is the body of the text.</p>
</body>
</html>

Browser View:

Example: With some data

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author">Me</p>
    <p>This is the body of the text.</p>
    <p>Consider this data:</p>
    <table>
        <tr>
            <th>Name</th>
            <th>Age</th>
        </tr>
        <tr>
            <td>John</td>
            <td>25</td>
        </tr>
        <tr>
            <td>Mary</td>
            <td>26</td>
        </tr>
    </table>
</body>
</html>

Browser View:

Example: With an image

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author">Me</p>
    <p>This is the body of the text.</p>
    <p>Consider this image:</p>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/About_The_Dog.jpg/640px-About_The_Dog.jpg" alt="About The Dog.">
</body>
</html>

Browser View:

What is CSS

  • CSS (Cascading Style Sheets) is very often used in addition to HTML to control the presentation of a document
  • Designed to enable the separation of content from things concerning the look, such as layout, colours, and fonts.
  • The reason it is interesting for web scraping is that certain information often get the same styling

Example: CSS

HTML:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
    <link rel="stylesheet" type="text/css" href="example.css">
</head>
<body>
  <h1 class="headline">My Headline</h1>
  <p class="author">Me</p>
  <div class="content">
    <p>This is the body of the text.</p>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/About_The_Dog.jpg/640px-About_The_Dog.jpg" alt="About The Dog.">
    <p>Consider this data:</p>
    <table>
      <tr class="top-row">
          <th>Name</th>
          <th>Age</th>
      </tr>
      <tr>
          <td>John</td>
          <td>25</td>
      </tr>
      <tr>
          <td>Mary</td>
          <td>26</td>
      </tr>
    </table>
  </div>
</body>
</body>
</html>

CSS:

/* CSS file */

.headline {
  color: red;
}

.author {
  color: grey;
  font-style: italic;
  font-weight: bold;
}

.top-row {
  background-color: lightgrey;
}

.content img {
  border: 2px solid black;
}

table, th, td {
  border: 1px solid black;
}

Browser View:

HTMl and CSS in Web Scraping: a preview

Using HTML tags:

You can select HTML elements by their tags

library(rvest)
read_html("data/example.html") |> 
  html_elements("p") |> 
  html_text()
[1] "\n    Me\n  "                  "This is the body of the text."
[3] "Consider this image:"          "Consider this data:"          
  • to select them, tags are written without the <>
  • in theory, arbitrary tags are possible, but commonly people use <p> (paragraph), <br> (line break), <h1>, <h2>, <h3>, … (first, second, third, … level headline), <b> (bold), <i> (italic), <img> (image), <href> (hyperlink), and a couple more.

Using attributes

You can select elements by an attribute, including the class:

read_html("data/example.html") |> 
  html_element("[class=\"headline\"]") |> 
  html_text()
[1] "My Headline"

For class, there is also a shorthand:

read_html("data/example.html") |> 
  html_element(".headline") |> 
  html_text()
[1] "My Headline"

Another important shorthand is #, which selects the id attribute:

read_html("data/example.html") |> 
  html_element("#table-1") |> 
  html_table()
# A tibble: 2 × 2
  Name    Age
  <chr> <int>
1 John     25
2 Mary     26
read_html("data/example.html") %>% 
  html_element("#table-1 > tr")
{html_node}
<tr class="top-row">
[1] <th>Name</th>
[2] <th>Age</th>

Extracting attributes

Instead of selecting by arrtibute, you can also extract one or all attributes:

read_html("data/example.html") |> 
  html_elements("a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"   "https://en.wikipedia.org/wiki/Dog"
read_html("data/example.html") |> 
  html_elements("a") |> 
  html_attrs()
[[1]]
                             href 
"https://www.johannesbgruber.eu/" 

[[2]]
                               href 
"https://en.wikipedia.org/wiki/Dog" 

Chaining selectors

If there is more than one element that fits your selector, but you only want one of them, see if you can make your selection more specific by chaining selectors with > (for the immediate next one) or an empty space:

read_html("data/example.html") |> 
  html_elements(".author>a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"
read_html("data/example.html") |> 
  html_elements(".author a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"

Common Selectors

There are quite a lot of CSS selectors, but often you can stick to just a few:

selector example Selects
element/tag table all <table> elements
class .someTable all elements with class="someTable"
id #table-1 unique element with id="table-1"
element.class tr.headerRow all <tr> elements with the someTable class
class1.class2 .someTable.blue all elements with the someTable AND blue class
class1 > tag .table-1 > tr all elements with tr with .table-1 as parent
class1 + tag .top-row + tr first elements with tr following .top-row

Exercises 2

  1. Add another image and another paragraph to data/example.html and display it in your browser
  2. Add a second level headline to the page
  3. Add another image to the page
  4. Manipulate the files data/example.html and/or data/example.css so that “content” is displayed in italics
  5. Practice finding the right selector with the CSS Diner game (https://flukeout.github.io/)
  6. Consider the toy HTML example below. Which selectors do you need to put into html_elements() (which extracts all elements matching the selector) to extract the information
library(rvest)
webpage <- "<html>
<body>
  <h1>Computational Research in the Post-API Age</h1>
  <div class='author'>Deen Freelon</div>
  <div>Keywords:
    <ul>
      <li>API</li>
      <li>computational</li>
      <li>Facebook</li>
    </ul>
  </div>
  <div class='text'>
    <p>Three pieces of advice on whether and how to scrape from Dan Freelon</p>
  </div>
  
  <ol class='advice'>
    <li id='one'> use authorized methods whenever possible </li>
    <li id='two'> do not confuse terms of service compliance with data protection </li>
    <li id='three'> understand the risks of violating terms of service </li>
  </ol>

</body>
</html>" |> 
  read_html()
# the headline
headline <- html_elements(webpage, "")
# the author
author <- html_elements(webpage, "")
# the ordered list
ordered_list <- html_elements(webpage, "")
# all bullet points
bullet_points <- html_elements(webpage, "")
# bullet points in unordered list
bullet_points_unordered <- html_elements(webpage, "")
# bullet points in ordered list
bullet_points_ordered <- html_elements(webpage, "")
# third bullet point in ordered list
bullet_point_three_ordered <- html_elements(webpage, "")

Homework

You did not come to class to just scrape exercise pages. You probably had some initial data and/or research question in mind. Please write a short abstract (~200-400 words) on what you want to accomplish with the web scraping skill you will learn here, so we can try and incorporate the necessary tools in one of the sessions this week. The abstract should include what data can be found on the website and what potential research quesions you have in mind.

Deadline: Today midnight!

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0 
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Amsterdam
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rvest_1.0.3

loaded via a namespace (and not attached):
 [1] vctrs_0.6.3       httr_1.4.6        cli_3.6.1         knitr_1.43       
 [5] rlang_1.1.1       xfun_0.39         stringi_1.7.12    jsonlite_1.8.7   
 [9] glue_1.6.2        selectr_0.4-2     htmltools_0.5.5   fansi_1.0.4      
[13] rmarkdown_2.23    evaluate_0.21     tibble_3.2.1      fastmap_1.1.1    
[17] yaml_2.3.7        lifecycle_1.0.3   stringr_1.5.0     compiler_4.3.1   
[21] codetools_0.2-19  pkgconfig_2.0.3   rstudioapi_0.15.0 digest_0.6.33    
[25] R6_2.5.1          utf8_1.2.3        pillar_1.9.0      magrittr_2.0.3   
[29] tools_4.3.1       xml2_1.3.5       

References

Clarke, Isobelle, and Jack Grieve. 2019. “Stylistic Variation on the Donald Trump Twitter Account: A Linguistic Analysis of Tweets Posted Between 2009 and 2018.” Edited by Christopher M. Danforth. PLOS ONE 14 (9). https://doi.org/10.1371/journal.pone.0222062.
Freelon, Deen. 2018. “Computational Research in the Post-API Age.” Political Communication 35 (4): 665–68. https://doi.org/10.1080/10584609.2018.1477506.
Luscombe, Alex, Kevin Dick, and Kevin Walby. 2022. “Algorithmic Thinking in the Public Interest: Navigating Technical, Legal, and Ethical Hurdles to Web Scraping in the Social Sciences.” Quality & Quantity 56 (3): 1023–44. https://doi.org/10.1007/s11135-021-01164-0.